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ABSTRACT 


Recent advancements in the internet, social media, and internet of things 
(IoT) devices have significantly increased the amount of data generated in a 
variety of formats. The data must be converted into formats that is easily 
handled by the data analysis techniques. It is mathematically and physically 
expensive to apply machine learning algorithms to big and complicated data 
sets. It is a resource-intensive process that necessitates a huge amount of 
logical and physical resources. Machine learning is a sophisticated data 
analytics technology that has gained in importance as a result of the massive 
amount of data generated daily that needs to be examined. Apache Spark 
machine learning library (MLIlib) is one of the big data analysis platforms 
that provides a variety of outstanding functions for various machine learning 
tasks, spanning from classification to regression and dimension reduction. 
From a computational standpoint, this research investigated Apache Spark 
MLlib 2.0 as an open source, autonomous, scalable, and distributed learning 
library. Several real-world machine learning experiments are carried out in 
order to evaluate the properties of the platform on a qualitative and 
quantitative level. Some of the fundamental concepts and approaches for 
developing a scalable data model in a distributed environment are also 
discussed. 
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1. INTRODUCTION 


The application of information technology in several fields, such as computing, networking, and 
storage capacity, has seen significant advancements in the previous decade [1]-[6]. The result of this 
advancement is the emergence of a new scientific paradigm: the era of huge data collection and exploration, 
which has evolved into a scientific discovery approach that is on an equal footing with conventional 
theoretical analysis, experimental designs, and computer simulation. The amount of data generated and stored 
has expanded dramatically over the last two decades as a result of the development of the internet of things 
(IoT), artificial intelligence, cloud computing, and other cutting-edge computer technologies [7]-[19]. 
Because more than 6000 tweets are sent out per second on Twitter, and the similar trend can be found on 
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Facebook, WhatsApp, and other social media platforms. Social media and the Internet have made substantial 
contributions to this rate of data generation [20]-[24]. Large amounts of data are generated by 
application/web servers from their logs at the organizational level. Other systems also contribute to the 
increased rate of data generation. As a result, data has evolved into an essential component of business 
existence. Because of the increased level of digital data generation, combined with the increasing complexity 
of the data. It has become impossible to process them using conventional data processing methods; as a 
result, efforts are now being directed toward developing advanced computing infrastructures that can handle 
data volumes and complexity of this magnitude (referred to as parallel and distributed processing) [25]—[42]. 

Managing vast amounts of data is a difficult task that demands the development of more complex 
systems in order to achieve accurate and timely enormous data analysis [43]-[45]. In order to process big 
data analytics problems in a timely and reliable manner, infrastructure for big data has been developed, 
allowing for high-quality performance and resource availability for self-service and convenience of use on 
demand. There are numerous machine learning frameworks for large data analysis presently accessible; they 
are relevant to different scientific domains and have been shown to be useful in healthcare informatics 
genetic data analysis, text exploration, and random picture modeling, among other applications. Apache 
Spark machine learning library (MLIib) is an open-source, in-demand, and independent library for big data 
analysis using machine learning techniques [46]. It has the advantage of having an automatic data balancing 
and a distributed design, making it a good choice for big data analysis. A collection of dominating people in 
occupations for numerous machine learning tasks, such as classification, regression, base compilation and 
extraction (and dimensional reduction), is introduced by Apache Spark [47]-[51]. Despite the fact that 
numerous research have been conducted on machine learning and its usefulness, ML libraries for big data 
analysis, such as Apache Spark MLlib, have received little attention. Perhaps this is the first study to look at 
libraries for large data analysis that are based on machine learning techniques. Big data analytics is primarily 
concerned with the advancement of computer infrastructures in such a way that data mining and analysis can 
be completed quickly and efficiently [45], [52]-[56]. It is the primary driving force behind the existing 
business. Because large data analytics 1s a computationally intensive operation, the user experience during 
large data analytics is influenced by the setup of different software and devices [57]-[66]. 

Several big data processing techniques have been recommended since the last decade due the failure 
of the conventional processing methods to handle the large volume of data generated daily from business and 
industrial processes [67]. As a result, researchers have been concentrating their efforts on developing more 
effective methods of obtaining value-added information from large amounts of data. There are many different 
types of studies in the area of big data processing models. For example, data flow models such as 
MapReduce, which facilitate data processing utilizing a variety of operators while sharing stable storage 
systems, are one type of study [68]-[74]. Resilient distributed datasets (RDDs) are a more efficient data 
sharing abstraction from stable storage since they do not require data copying, which saves money. In 
most high-level application programming interfaces (APIs) for data flow systems, integrated language 
APIs [75]-[81] are provided, which allow the user to interact with "parallel groups" through operators such 
as map and join. Parallel groups on these systems, on the other hand, either represent files on disk or the 
temporary data sets that were used for query plan expression on these systems. Despite the fact that systems 
have the ability to convey data via the operators in the same query, data exchange through inquiries has 
proven to be inefficient. As a result, Spark's API is built on the parallel summation model, which is 
convenient to implement. It does not claim to be the first to use an integrated interface language, but by 
including RDDs as a storage layer behind this interface, it can support a larger range of applications. 

The systems in the following category are those that provide high-level interfaces for specialized 
applications that require data sharing, as described above. Pregel [82] provides support for redundancy 
diagramming applications, whilst Twister [83] and HaLoop [84] are iterative MapReduce programs, 
respectively. These frameworks only provide data sharing for the calculation styles that are supported, and 
they do not provide an universal abstracting framework. They can only be used by the user to share selected 
data from specified operations. For example, a user cannot load data into memory using Pregel or Twister 
and then select the query to run on it after it has been loaded. The fact that RDDs expressly provide 
distributed storage means that it can be used to enable applications that are not currently supported by these 
specialized systems, such as interactive data mining. According to [85], it is proposed a methodology that 
demonstrates to be an upgrade over standard big data analytics methods that use either Hadoop/Spark or deep 
learning as distinct components. Lunga et al. [85] proposed a framework that makes use of Spark's 
distributed computing capabilities as well as deep learning architecture for multiple layers perceptron (MLP) 
using cascade learning to train multiple layers perceptrons is proposed. A framework for in-depth training 
learning models with Apache Spark has been created and developed in [47], [48], [50], [51], [57], [69], [86]— 
[92]. This framework shortens the training time by taking advantage of the advantages of both data and parity 
modeling at the same time. It is possible to create data parallelism by distributing training data across many 
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Spark block machines and replicating models on each device [86]. Each model goes through its training in 
parallel with the data part. The parallelism model is implemented by distributing each replica of the deep 
neural network model over the spark group in a layer-by-layer fashion. 

The impact of various software and hardware configurations on the problem of big data processing 
is explored in this research. The focus of the presentation is on the capabilities and advantages of Apache 
Spark MLlib 2.0 as a large data analytics tool, particularly in relation to Hadoop. This study is developed as a 
means of providing insight into the usage of machine learning libraries in big data analysis from the 
standpoint of industry. This work opens the door to other elements of big data analysis utilizing machine 
learning methods, which is regarded to be a rapidly expanding study topic. Several real-world tests is carried 
out to investigate the qualitative and quantitative aspects of Apache Spark MLlib 2.0. Moreover, a 
comparative research is carried out using the massive online analytics (MOA) library, which is a well-known 
Java-based machine learning library that is widely used in the industry. Furthermore, the performance of 
several commonly used machine learning models for big data analysis is examined, and compared across a 
variety of software and hardware settings. The remaining part of this article 1s arranged thus: section 2 
introduces Apache Spark MLlib. The method and components of the investigated Apache Spark MLlib 2.0 
are presented in section 3, while the results and discussion of the features and benchmarking are presented in 
section 4, and conclusion is presented in section 5. 


2. APACHE SPARK MLLIB 2.0 

This is a scalable and fast big data processing engine that was first developed by the AMPLab at the 
University of California, Berkeley [93 ]|—[95]. It may be used to construct distributed applications in a variety 
of computer languages, including Java, Python, and other programming languages [96]-[105]. When it is 
installed, it includes four major libraries: Apache Spark structured query language (SQL), Apache Spark 
Streaming, Apache Spark MLIlib, and Apache Spark GraphX. These libraries are described in more detail 
below [106]-[108]. However, despite the fact that the most basic scheduling Spark modules are Apache 
Spark Streaming, which is fault tolerant and performs high level analytics, Apache Spark SQL performs 
relational queries for a variety of mining databases because it incorporates a data abstraction model known as 
data frames [109]. It is important to note that Apache Spark GraphX [110] is a high-level Apache Spark 
processing library that can handle two commonly used data structures utilizing distributed arithmetic models. 
Apache Spark MLlib provides >55 scalable machine learning algorithms for big data analytics, taking 
advantage of the advantages of both data and the data collection method. As well as enabling the 
implementation of numerous machine learning strategies, such as grouping and regression; classification; 
rule extraction; and dimensional reduction. It also enables the rapid and simple creation of machine learning 
approaches for large-scale applications [67], [111]-[116]. 

A suite of multiple-language APIs is also available on the Apache Spark MLlib [117] platform for 
the evaluation and deployment of a wide range of machine learning techniques. In recent years, several 
changes have been made to multiple areas of data science solutions [118], [119], and a number of academics 
have committed attention to the creation of the components of Apache Spark MLlib for big data analytics. 
Figure 1 depicts the development side of Apache Spark MLlib track 2.0, with a unique number of anchors 
assigned to each release of the library [120]-[122]. This section discusses some of the recent improvements 
in Apache Spark MLlib applications, including some of the new features introduced. In order to aid in the 
development of smart transportation applications, a scalable and open-source platform known as connected 
vehicles and smart transportation (CVST) has been proposed by a number of researchers. The proposed 
CVST is built of four essential components: data distribution, resource management, business intelligence, 
and application. The business intelligence component is in charge of data analytics, and it makes use of 
MLlib to process and transmit data to the front end. According to the findings of the study [107], [123], 
[124], an architectural design for academic information system services for students enrollment pattern 
analysis should be considered. This system makes use of MLlib to anticipate the suggested courses for the 
forthcoming semester, which is a powerful prediction tool. 

Sparktext is a text mining framework developed by Ye et al. [125] for use with Apache Spark 
learning and flow algorithms in conjunction with the Cassandra NoSQL database [90], [126]-[129]. The 
database was built using a big collection of medical publications for the purpose of cancer type classification. 
Aurora [130] demonstrated how to analyze web-sourced mobile data using Apache's K-algorithm Spark 
MLlib, which is based on the Spark algorithm. The study gave an effective technique of determining the 
number of grid users based on the grouping of latitude and longitude information, which was based on the 
results of the investigation. When learning human behaviors, the study by [122] provided ALMD, which 
performs feature description by monitoring the appearance and movement randomly based on the usage of 
the Apache Spark ML library and the usage of Apache Spark ML library. Assefi et al. [131] have described 
the construction of a framework for demographics analysis utilizing next-generation data sequencing as a 
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case study. In order to optimize the system, it is necessary to update the resource estimator and optimize the 
components. The system was developed entirely on Apache Spark, and as a result, it takes advantage of the 
favorable aspects of the MLlib and other Spark components. BigNN was developed by Assefi et al. [131] as 
another fascinating feature of big data analytics on Apache Spark. It is capable of handling biomedical strings 
on a very large scale, which is very useful in the healthcare industry. MLlib can be implemented using 
programs written in R, Scala, Python, and Java, among other programming languages. Vector, LabeledPoint, 
and rating are the core data abstractions used by MLlib; as a result, the pedestrian and other statistical 
components of MLlib work on data represented by these abstractions. Observational data features are 
captured using the vector type, which represents an index set of double type values with a zero-index of the 
int type. The vector type is used to record the observational data features. 

A vector of length n might theoretically represent a note with n properties, which would imply that it 
represents an object in a file with N dimensions. The vector type offered by MLlib differs from the vector 
type supplied by the Scala set library in that the vector type in MLIib implements the digital vector concept 
from linear algebra, but the vector type in the Scala set library does not. MLIib is capable of handling both 
dense and sparse vector types. In addition, because the MLlib Vector type is considered an adjective, it 
cannot be instantiated directly by the application; instead, the factory methods given by MLlib must be 
utilized to construct an instance of either the sparse vector class or the dense vector class. It should be noted 
that the factory methods for creating instances of the dense vector or sparse vector classes are already 
specified in the vectors object, which is convenient. 


Spark Ecosystem 


SparkSQL Streaming — MLlib ) GraphX 
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Figure 1. MLlib in spark ecosystem 





3. METHOD 

Spark's machine learning library, MLlib, has been under heavy development since its inception, and 
unlike the Spark core. It is still not in a fully stable state with regard to its overall API and design. As of 
Spark version 1.2.0, a new, experimental API for MLlib has been released under the ml package (whereas the 
current library resides under the MLlib package). Figure 1 shows the Spark ecosystem with MLlib. This new 
API aims to enhance the APIs and interfaces for models as well as feature extraction and transformation so as 
to make it easier to build pipelines that chain together steps that include feature extraction, normalization, 
dataset transformations, model training, and cross-validation. Since the new API is still experimental, it may 
be subject to major changes in the next few Spark releases. Over time, the various feature-processing 
techniques and models that we will cover will simply be ported to the new API; however, the core concepts 
and most underlying code will remain largely unchanged. 

This section summarized the tests carried out on the six datasets listed in Table 1. The findings were 
provided in terms of the processing time for MLlib and MOA when both programs were run on the same 
hardware. The performance of Apache Spark MLlib 2.0 was compared and evaluated on six distinct large 
datasets obtained from the University of California, Irvine's machine learning repository. The experimental 
setup used in this work consisted of a standalone Spark cluster that makes use of an HDFS storage system 
and Apache Zeppelin 0.7.1 as an editor, both of which were developed by the authors. The Spark cluster is 
made up of the following components: a master node that runs a driver software; three worker nodes; and a 
data node (includes! worker node that executes on the master node). Similar to the design illustrated in 
Table 2, the three nodes had a similar configuration. The three worker nodes each had a memory capacity of 
48 GB, and each worker node was configured with four executors (each with a memory capacity of 4 GB) 
and two CPUs. Each worker in the master node was configured with three executors (each with a size of 
5 GB) and two cores, as shown in the diagram. A total of 16 GB of RAM was allocated to the driver process. 
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The MLlib was run on a Scala 2.11.8 PL in a Spark 2.2.1 cluster, with Hadoop 2.7.3 serving as the 
distributed storage device, and the results were published. The amount of RAM available to the executors in 


each worker node was changed by employing the optimal number of data partitions in order to obtain the 
fastest possible execution time. Table 1 describes the characteristics of the datasets that were used in this 


investigation in terms of the amount of attributes, records, and classes that they contained. 


Data 
Covtype 
Covtype-2 
Higgs 


Botnet Attacks 


Dota2 
SUSY 


Parameter 
Operating system 
CPU 
Memory 
No. of workers 
Computational framework 
Compatible framework 
DSS 
Code development editor 
Coding language 


4. RESULTS AND DISCUSSION 


Table 1. Dataset description 


No of record Noofattributes No of classes 


581012 54 7 
581012 54 2 
11,000,000 28 2 
7,062,606 115 10 
102944 116 2 
5,000,000 18 2 


Table 2. System description 
Specification 
Windows10 
Intel® Core™ 17-6700 CPU @ 3.40 GHz with 8 logical cores 
16 GB 
3 
Apache Spark 2.2.1 
Radoop 
HDFS (Hadoop 2.7.3) 
Apache Zeppelin 0.7.1 
Scala 2.11.8 


The implementation process was kicked off by first defining the Spark context for the program that 
was selected. As previously stated, this is the primary point of entry for Spark functionality, and it must be 
given before attempting to create the RDDs. The three Spark Context parameters, which are the application 
name, the number of cores, and the URL of the cluster, were also supplied in the configuration. In addition, 
the name of the application should be significant in order to clearly identify the program's objective. To 
specify the name of an application for a local cluster, the keyword “local” is used. Worker nodes are 
responsible for processing work in Spark and, as previously stated, the number of worker nodes to be formed 
is dictated by the number of cores available. The following step is to train the model using the training data 
and to provide the parameters that are accessible for the supervised machine learning methods that have been 
selected (support vector machines (SVM), decision tree, and logistic regression). The parameters for the 
decision tree, SVM, and logistic regression methods were shown in Tables 3, 4, and 5, respectively. 

The testing of the trained model on the testing set is the next step; this was accomplished using the 
“predict” method which was implemented using the “map” transformation of Spark for each row of the test 
set. The comparison of the computational time of Apache Spark MLlib and MOA under different 
experimental conditions is shown in Figure 2. 


Table 3. The decision tree classification technique relies on a number of parameters 


Parameter Explanation Value used 
maxBins The required number of bins for finding the splits at each node; the default value is 32. 32 
minInfoGain The minimum info gain needed for the creation of a split; the default value is 0.0 0.15 
numClasses The required number of classes to execute classification tasks 2 
maxDepth The maximum tree depth; the default value is 5. 6 
impurity The required criterion for the selection of information gain (gini or entropy). entropy 


Table 4. The parameters that were used in the SVM classification algorithm 


Parameter Explanation Value used 
validateData Data must be validated by the algorithm before training TRUE 
iterations The number of considered iterations 1000 
numClasses The number of considered classes; the default value is 2. 2 
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Table 5. The parameters that were used in the logistic regression algorithm 





Parameter Explanation Value used 
validateData Data must be validated by the algorithm before training TRUE 
iterations The number of considered iterations 1000 
numClasses The number of considered classes; the default value is 2. 2 
Covetype Covetype2 
20 w 20 w 
j j z d d ; 
10 =| 10 E 
= zZ 
L $ : 4], z 
— , § Si É 
Milib MOA Mllib MOA r Milib MOA Milib MOA m 
wi wi 
ENV2 ENV1 ENV2 ENV1 
BOT BSVM GLR BDT BSVM OLR 
(a) (b) 
Higgs Botnet attacks 
60 w 60 wu 
40 = 40 2 
a dh a > : dh — wih 2 
—bi os is _ g 
Milib MOA Mllib MOA D Mllib MOA Mllib MOA D 
Se Lu 
ENV2 ENV1 = ENV2 ENV1 in 
GDI BSVM ALR EDT BsvM GLR 
(c) (d) 
Dota2 SUSY 
15 w 40 w 
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— 
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Figure 2. Computational time of Apache Spark MLlib and MOA under different datasets: 
(a) Covetype, (b) Covetype2, (c) Higgs, (d) Botnet attacks, (e) Dota2 and (f) SUSY database 


There was a close similarity in the area under the ROC for both Apache Spark MLlib and MOA as 
the difference between them was not statistically significant. However, the little difference between them 
could be due to the detailed parametric settings of each classifier during the random selection of the test and 
train datasets. Obviously, Apache Spark MLlib was faster than MOA based on the observed computational 
times of the classifiers; however, the clustering method showed statistically significant differences between 
the Apache Spark MLlib and MOA. 


5. CONCLUSION 

Data generation has increased at an alarming rate in recent years, necessitating advancements in data 
analytics and processing tools in order to enable the extraction of relevant information from vast amounts of 
organized and unstructured data. Big data machine learning techniques which are believed to be efficient in 
pattern finding, can be used to more efficiently handle this challenge. Apache Spark MLlib is a widely used 
machine learning library for big data, and it is a powerful tool for big data analytics. As proved in this study, 
it provides excellent performance in terms of computational time. Massive online analytics (MOA), on the 
other hand, is slightly slower than Apache Spark MLlib during big data analysis; however, because the 
classifiers use different configurations and file systems, the comparison may not be appropriate. MLlib was 
implemented on the Spark distributed file system, whereas the MOA classifier was implemented on the 
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Hadoop distributed file system, the comparison may not be appropriate. Because we want to demonstrate 
how well Spark performs on large data sets using MOA as a benchmark, it is assumed that there are many 
MOA features that Spark cannot compete with, such as the availability of a large pool of resources and 
documents for MOA users, the ease with which non-experts can implement MOA, and the presence of a 
good graphical user interface in MOA, among other things. These characteristics are the reason why MOA 
supports a variety of machine learning techniques. 
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